Add device ID validation to prevent cryptic GPU failures#2712
Add device ID validation to prevent cryptic GPU failures#2712
Conversation
|
@ggorman 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs. I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review. |
Co-authored-by: ggorman <5394691+ggorman@users.noreply.github.com>
|
@copilot - CondGe does not do what you think it does; see github actions failures. |
Co-authored-by: ggorman <5394691+ggorman@users.noreply.github.com>
Fixed the CondGe import error in commit b03b63d. The issue was that CondGe doesn't exist in devito.symbolics - I've replaced it with Ge from devito.types.relational which provides the correct >= comparison functionality for conditionals. |
Co-authored-by: ggorman <5394691+ggorman@users.noreply.github.com>
|
as usual, this is a mess |
|
@copilot - pull in .github/copilot-instructions.md and use the additional knowledge from that file to revise your patch. |
Co-authored-by: ggorman <5394691+ggorman@users.noreply.github.com>
Fixed in commit afbd708. Simplified the implementation to follow Devito's established patterns - removed unnecessary complexity, shortened error messages, used proper imports, and ensured 90-character line compliance. The validation logic is now much cleaner while preserving the core functionality.
Revised the implementation using the copilot instructions from PR #2713 in commit afbd708. The new approach follows Devito's coding standards: uses SymPy directly, implements minimal changes, maintains 90-character line limits, and uses established patterns for device handling. The solution is now much cleaner while preserving the device validation functionality. |
|
closing because it's not in a mergeable state and it hasn't received contributions recently |
This PR adds device existence validation to GPU operators to prevent cryptic failures when CUDA_VISIBLE_DEVICES and container GPU configuration are mismatched.
Problem
When using Docker with both
--env CUDA_VISIBLE_DEVICES=1and--gpus "device=1", the container runtime only exposes device 0 but renumbers it from the original device 1. This causes Devito to attempt using device ID 1 (from CUDA_VISIBLE_DEVICES) on a system that only has device 0 available, leading to opaque failures like:Solution
Added device validation logic that generates runtime checks before calling
acc_set_device_num()oromp_set_default_device(). The generated code now includes:OpenACC validation:
OpenMP validation:
Benefits
Implementation Details
The solution follows Devito's established patterns:
Leveraged existing infrastructure: Uses the existing 'num-devices' mappers for generating appropriate runtime calls (
omp_get_num_devicesoracc_get_num_devices)Minimal code changes: Enhanced device setup paths to include validation logic before device assignment, following Devito's 90-character line limit and coding standards
Proper integration: Works seamlessly with both sequential and MPI device setup configurations
Changes
devito/passes/iet/langbase.pyto add concise validation logic using established patternstests/test_gpu_openmp.pyandtests/test_gpu_openacc.pyto validate the new behaviorFixes #2711.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.